[nvbugs/5368410][fix] Disable moe allreduce for multi node #5918

yizhang-nv · 2025-07-10T12:48:49Z

PR title

MOE allreduce uses IPC memory, which cannot be used on a multi-node system. This pr disables the finalize moe allreduce kernel for non p2p supported systems.

Description

Please explain the issue and the solution in short.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

yizhang-nv · 2025-07-10T12:49:24Z

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-[Post-Merge]-1,GB200-8_GPUs-2_Nodes-PyTorch-[Post-Merge]-2,GB200-8_GPUs-2_Nodes-PyTorch-[Post-Merge]-3,GB200-8_GPUs-2_Nodes-PyTorch-[Post-Merge]-4"

tensorrt-cicd · 2025-07-10T12:55:20Z

PR_Github #11551 [ run ] triggered by Bot

yizhang-nv · 2025-07-10T14:45:31Z

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-[Post-Merge]-1,GB200-8_GPUs-2_Nodes-PyTorch-[Post-Merge]-2,GB200-8_GPUs-2_Nodes-PyTorch-[Post-Merge]-3,GB200-8_GPUs-2_Nodes-PyTorch-[Post-Merge]-4"

tensorrt-cicd · 2025-07-10T14:51:23Z

PR_Github #11558 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-10T14:51:25Z

PR_Github #11551 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-07-10T16:38:57Z

PR_Github #11558 [ run ] completed with state SUCCESS
/LLM/release-0.21/L0_MergeRequest_PR pipeline #221 (Partly Tested) completed with status: 'FAILURE'

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

yizhang-nv · 2025-07-11T01:15:52Z

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-[Post-Merge]-1,GB200-8_GPUs-2_Nodes-PyTorch-[Post-Merge]-2,GB200-8_GPUs-2_Nodes-PyTorch-[Post-Merge]-3,GB200-8_GPUs-2_Nodes-PyTorch-[Post-Merge]-4"

tensorrt-cicd · 2025-07-11T01:21:27Z

PR_Github #11582 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-11T04:47:48Z

PR_Github #11582 [ run ] completed with state SUCCESS
/LLM/release-0.21/L0_MergeRequest_PR pipeline #223 (Partly Tested) completed with status: 'SUCCESS'

yizhang-nv · 2025-07-11T05:58:08Z

/bot run --add-multi-gpu-test

tensorrt-cicd · 2025-07-11T06:03:49Z

PR_Github #11611 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-11T08:16:41Z

PR_Github #11611 [ run ] completed with state SUCCESS
/LLM/release-0.21/L0_MergeRequest_PR pipeline #224 completed with status: 'FAILURE'

yizhang-nv · 2025-07-11T08:18:00Z

/bot run --add-multi-gpu-test

tensorrt-cicd · 2025-07-11T08:23:11Z

PR_Github #11639 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-11T09:22:13Z

PR_Github #11639 [ run ] completed with state FAILURE

yizhang-nv · 2025-07-11T22:30:45Z

/bot run --add-multi-gpu-test

tensorrt-cicd · 2025-07-11T22:38:54Z

PR_Github #11679 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-12T05:17:13Z

PR_Github #11679 [ run ] completed with state SUCCESS
/LLM/release-0.21/L0_MergeRequest_PR pipeline #235 completed with status: 'SUCCESS'

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com> Signed-off-by: Ransiki Zhang <ransikiz@nvidia.com>

yizhang-nv requested review from a team as code owners July 10, 2025 12:48

yizhang-nv requested review from pcastonguay and suyoggupta July 10, 2025 12:48

yizhang-nv force-pushed the war-moe-ar branch from 71a940d to bca12f1 Compare July 10, 2025 14:44

yizhang-nv changed the title ~~fix: Disable moe allreduce for multi node~~ [nvbugs/5368410][fix] Disable moe allreduce for multi node Jul 10, 2025

yizhang-nv force-pushed the war-moe-ar branch from bca12f1 to 5767444 Compare July 11, 2025 01:15

fix: Disable moe allreduce for multi node

31f8a0a

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

yizhang-nv force-pushed the war-moe-ar branch from 5767444 to 31f8a0a Compare July 11, 2025 01:15

litaotju approved these changes Jul 11, 2025

View reviewed changes

yizhang-nv merged commit 332a65b into NVIDIA:release/0.21 Jul 14, 2025
3 checks passed

Wanli-Jiang mentioned this pull request Jul 17, 2025

Rebase 0.21-NIM to use changes from 0.21 branch #6117

Closed

dc3671 pushed a commit to dc3671/TensorRT-LLM that referenced this pull request Jul 21, 2025

[nvbugs/5368410][fix] Disable moe allreduce for multi node (NVIDIA#5918)

9778bcc

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

dc3671 pushed a commit to dc3671/TensorRT-LLM that referenced this pull request Jul 21, 2025

[nvbugs/5368410][fix] Disable moe allreduce for multi node (NVIDIA#5918)

286079a

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

dc3671 pushed a commit to dc3671/TensorRT-LLM that referenced this pull request Jul 22, 2025

[nvbugs/5368410][fix] Disable moe allreduce for multi node (NVIDIA#5918)

9a690f2

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

dc3671 pushed a commit that referenced this pull request Jul 22, 2025

[nvbugs/5368410][fix] Disable moe allreduce for multi node (#5918)

eb7d0f8

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

[nvbugs/5368410][fix] Disable moe allreduce for multi node #5918

[nvbugs/5368410][fix] Disable moe allreduce for multi node #5918

Uh oh!

Conversation

yizhang-nv commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR title

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

yizhang-nv commented Jul 10, 2025

Uh oh!

tensorrt-cicd commented Jul 10, 2025

Uh oh!

yizhang-nv commented Jul 10, 2025

Uh oh!

tensorrt-cicd commented Jul 10, 2025

Uh oh!

tensorrt-cicd commented Jul 10, 2025

Uh oh!

tensorrt-cicd commented Jul 10, 2025

Uh oh!

yizhang-nv commented Jul 11, 2025

Uh oh!

tensorrt-cicd commented Jul 11, 2025

Uh oh!

tensorrt-cicd commented Jul 11, 2025

Uh oh!

yizhang-nv commented Jul 11, 2025

Uh oh!

tensorrt-cicd commented Jul 11, 2025

Uh oh!

tensorrt-cicd commented Jul 11, 2025

Uh oh!

yizhang-nv commented Jul 11, 2025

Uh oh!

tensorrt-cicd commented Jul 11, 2025

Uh oh!

tensorrt-cicd commented Jul 11, 2025

Uh oh!

yizhang-nv commented Jul 11, 2025

Uh oh!

tensorrt-cicd commented Jul 11, 2025

Uh oh!

tensorrt-cicd commented Jul 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yizhang-nv commented Jul 10, 2025 •

edited

Loading